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1. Introduction 

As storage sites grow in size to thousands of 
disks, and as the need to predict availability and 
reliability increases, researchers and designers 
need a better quantitative understanding of the 
ways that disks fail or lose data. Unfortunately, 
these numbers are hard to come by. Disk 
manufacturers have some approximations to 
these numbers from their own (mostly 
accelerated) testing and through customer 
returns. Those data are not directly relevant to 
what their customers observe when deploying 
systems. There are several reasons for this. 

• Their focus is not historical; rather, they 
intend to determine quality of current 
products (and the budget for warranty funds). 

• Most quality tests are based on artificially 
accelerated testing that tries to predict the 
failure behavior of disks over their economic 
hfespan through tests that last at best a few 
months. 

• Forensic (non-accelerated) quality data do not 
address silent data corruption (bit-rot). 

• A large number of returned disks show no 
problem, though the customer problem that 
precipitated the disk return was real. 

Some large disk customers, such as Network 
Appliance [ES03], [SE04], [SE05] have failure 
data; however, most customers do not 
systematically gather data, because IT groups are 
in the business of operating storage systems, not 
data-gathering about them. As a result, much 
academic and corporate research is based on 
anecdotes and back of the envelope calculations, 
e.g. in Schwarz et al. [S&al04]. 
The Intemet Archive is making the first public- 
domain attempt to obtain and disseminate data 
about disk failures and disk data loss. The 
activity has several goals. Historically the first 



goal was to use already collected data on disk 
failure over several years at several installations 
to determine failure rates and thereby better 
manage the Archive data assets. Very shortly 
into our investigation, it became obvious that 
data collection should be planned. Planned 
measurements give better consistency and easier 
evaluation. Consequently, the idea of an Archive 
Observatory emerged. We are currently 
planning experiments to test for bit rot on a large 
number of drives that became available when the 
Intemet Archive switched to better enclosures. 

2. Internet Archive 

The Internet Archive is a nonprofit organization 
based in San Francisco established to preserve 
Web sites by taking "regular snapshots". It has 
now extended its mission to preserve as much 
digital and digitalizable data as possible. Its data 
set currently grows at about 25 TB/month. Its 
Wayback machine not only allows future 
historians to access data that would otherwise 
not be archived, but has already become a daily 
tool for investigating such things as trademark 
disputes. It complements its small staff with 
highly dedicated and educated volunteers. 
The Internet Archive stores its archival data at 
several sites, including sites in Alexandria, 
Egypt, Amsterdam, Netherlands, and several 
sites in San Francisco, CA. For cost reasons, it 
stores data on desktop ATA disks. These are 
now located in four-disk pizza-box-form-factor 
nodes that replace ones with a bulkier form 
factor. Of the four disks, one is the Linux OS 
boot disk, while the others only store data 
configured as a JBOD. There are 40 storage 
nodes in a rack and the San Francisco clusters 
have 36 racks. Over time, the enviroirniental 
quality of sites has differed. At one point, a 



main site did not have any air-conditioning, but 
those servers have now been transferred to a 
much more adequate location. 
The workload of the boot disks is similar to other 
server applications, whereas disks that only carry 
data have a small write load and a usually small 
but occasionally intense read load. The Intemet 
Archive stores most of its current data in 
-lOOMB archive files (ARC files). These files 
are written once and very rarely accessed. 
However, there are extreme hotspots caused by 
demand for certain data. For example, after the 
big Indian Ocean tsunami of 2005, videos of that 
event were uploaded and subsequently became 
very popular downloads. As another example, 
when the music group The Grateful Dead 
changed its policy on allowing fans to offer 
personal recordings of concerts on their own 
websites, the change triggered "panic" music 
downloads from the Archive. 
As an archive, the Internet Archive is interested 
in storing all of its data. Some data is more 
important than others and is either mirrored or 
triplicated over several systems. Other data, 
such as data from the Wayback machines is not 
so important since crawls at different times read 
and archive often the same web pages. 
Therefore, only one copy is stored. If this is lost, 
then the state of that website cannot be 
reconstructed, though most of the content of the 
lost visit is found in the ARC file of the previous 
or subsequent visit. To assess the integrity of its 
data, the Intemet Archive calculates and stores 
the MD5 hash value of each archive file. It 
periodically accesses the files to recalculate the 
MD5 hash. In some instances, these diagnostic 
accesses are the majority of the accesses seen by 
a file. In addition, the Archive also stores 
SMART statistics and kemel logs. 
One set of past data, mainly from American 
installations, covers 4973 disks. It will soon be 
augmented by more recent data from 2005 and 
2006. In addition, there is a collection of manual 
log entries recording when disks were taken out 
of service and what data located on them was 
"evacuated" for reconstruction and storage. The 
logs also have anecdotal comments. All digitally 
available data is made freely available at the 
Internet Archive. In addition, our results are 
public and will be posted on the Intemet Archive 
website [IA06]. 

3. Goals and Methodology 

Much of our work is currently concentrated on 
using archival failure data to determine disk 



failure rates and bit rot rates as a function of 
environmental factors, disk age and disk vintage. 
Trying to make sense out of the data set taught 
us how hard the analysis is. Disk failure, sector 
failure, and bit-rot (silent corruption) rates are a 
quotient of occurrence counts over total service 
times. Both numerator and denominator are 
hard to determine. At the level of the user / 
system administrator, disk failures, file system 
failures, controller failures, bus failures, etc. all 
appear as a generic failure. Designing and 
hardening disk based systems, however, requires 
understanding the failure types in greater detail. 
For example, we observe instances where a disk 
behaves erratically; but, when moved to another 
node the disk functioned well. There are at least 
four different possible root causes for this 
behavior, and each one has very different 
implications for the interpretation of the data. 
First, the original disk mounting failed, allowing 
the disk to vibrate. When moved, the new 
mounting was secure. Second, the original 
interface cable (or controller) for the disk was 
damaged. Any disk that replaced the original 
disk will unfortunately also fail intemnittently. 
Third, the disk has partially failed. When 
moved, new data is written to the disk in 
different places and / or the faiUng locations 
were remapped. Fourth, the disk was indeed 
behaving erratically. 

Gathering accurate population data is even more 
difficult. Disks are identified by a serial number, 
though there were about a dozen instances where 
the serial number was a fake zero identifier. 
While in principle disks remained in the node in 
which they were initially mounted, some 
problems were taken care off by swapping disks. 
Manual logs were not always accurate, 
understandably, since disk swaps are typically 
triggered by failures. 

SMART makes a large set of environmental, 
error and other metrics available. We used the 
Linux smartctl utility to interpret the SMART 
data [SCTL]. SMART data interpretation and 
format can vary between manufacturers (three in 
the case of the Internet Archive) and some 
SMART counters (such as the disk power-on 
lifetime counter) consist of only 16 bits that rolls 
over after 45.5 days on a considerable part of our 
population. Nevertheless, SMART allows us to 
identify whether a disk has been in continuous 
use (and not taken out and later returned to 
service), to gather environmental data, and most 
importantly, to analyze the condition of a disk 
through SMART error codes. 



To test for bit-rot, MD5 digests were used on 
complete files. MD5 is a widely used 
cryptographic hash function with a 128 bit hash 
value (see RFC 1321). An MD5 digest changes 
its value with extremely high probability if the 
file changes. 

As with almost all real world data, our 
measurements have gaps or periods without data. 
Gaps may be caused physically, for example by 
powering down a system. Gaps may also be 
caused by losing or corrupting the actual 
SMART or MD5 measurement data. As such, 
tracking a disk over time is difficult. We 
developed the notion of "chains" in order to deal 
with this A chain is a set of measurements of a 
single object that are contiguous, where with 
"contiguous" we mean a measurements taken so 
close to each other that we can infer that the disk 
or file stayed in place. We then developed 
database schemes in order to capture data on 
MD5 and SMART chains. All statistics can then 
be computed over both the cumulative chain time 
or, occasionally, cumulative wall clock time. We 
can also preferentially select short or long chains 
for analysis. All of this work is also in 
preparation for the Archive Observatory, our 
instrumentation of the Internet Archive storage 
system. 

4. Preliminaty Results 

4.1. Bit-rot Data 

We gathered and analyzed data on MD5 that 
give some insight into the occurrence of bit-rot. 
The same data were used in an assessment of 
long-term reliability of digital storage [B&al06]. 
The state history of each archive file is 
represented by a series of MD5 digests. A total 
of 112,865,205 MD5 digests for about 4,717,158 
files are available, though so far only a small 
fraction (1,496,572 records representing 36,169 
files) has been analyzed. Not all changes of an 
MD5 digest indicate actual bit-rot. Often, a 
transient change of an MD5 digest seemed to be 
caused by a temporary inability to access the file. 
Out of 10 disks where a different MD5 value for 
a file was recorded, only two were ultimately 
attributed to likely disk failure. Sometimes, the 
MD5 digest switches to a new value but then 
returns to the original value, most likely the 
result of a problem with the node, controller, 
bus, or processor computing the MD5, but not 
with the disk if other files on the same drive do 
not experience MD5 digest changes. 



4.2. SMART Data 

We collected the SMART data in a single 
database. In a separate log that we have not yet 
succeeded in integrating, we have data on the 
load for each block device, including read and 
write operations per second. 
Our current data do not yet allow us to 
characterize why disks leave the system. A disk 
that powers down and does not restart (i.e. a 
classical disk failure) and a disk that is 
decommissioned look the same to our database. 
Currently, we have 93,385 chains. 470 of these 
chains exhibit failure, sometimes reflecting 
oscillation between failure and normal state. We 
call this behavior bouncing. These data represent 
total measurements of 1,039,516.27 days. In 18 
chains, SMART indicated a failure free state 
followed by a failure state for the rest of the 
chain. 6 of these disks left the population. A total 
of 36 out of 4,978 disks (0.7%) exited the 
population after exhibiting SMART indicated 
failures. If a disk powers down and then does not 
come up, we only see a disk exiting the 
population; we are not able to say whether the 
exit was planned or due to failure. 
SMART data indicate either a healthy or failing 
state of a disk. We observed 136 "oscillating" 
chains that change state more than once (out of 
the 470 chains with failure). However, 94 of 
these seem to have a spurious indication of a 
healthy state. The remaining 42 oscillating 
chains represent 27 different disks out of a 
population of 4978. In one case only, the disk 
was moved to another node, and in this case, the 
disk continued to form an oscillating chain. 8 
out of the 27 disks seemed to have left the 
population permanently and prematurely. All 8 
formed oscillating chains, typically 2-3. In these 
cases, SMART gave failure warnings before the 
disk actually stopped working or before it was 
pulled from the machine. 

Our next step is to correlate SMART data with 
data on bit-rot. We have identified a number of 
disks that housed files whose MD5 either 
changed or oscillated. Our database allows us to 
correlate these events and SMART reports of 
physical conditions. 

The Internet Archive has logs that record disks 
being removed from nodes because of instability 
or failure. With our database, we are now in the 
position to access SMART data for these drives 
before their removal. 



4.3. Disk Failure Rates 

Based on maintenance records, the Archive does 
not see systematic disk infant mortality. 
However, it experienced in 2005 two incidents of 
high failure rates in two unrelated batches of 
disks. The manufacturer attributed this to 
shipping or handling mistreatment. Our Archive 
Observatory has now data for 2489 disks for 
almost the complete year 2005. Disk failures 
appeared at a rate of 2.00%, but the data also 
show a combined failure rate of 2.54% for 
motherboard, CPU, memory, et cet. failure. In 
this period, almost half the replacements were 
due to disk drive failure, the other replacement 
were due to substantial sector failure. 
Replacements in the past have been as high as 
6%. The lower values now observed are in line 
with [GI05]. 

5. Future Work 

Obviously, much work still needs to be done 
before our initial target, is completed. At the 
same time, our efforts have shown the need to 
gather data in a more systematic way, i.e. the 
setting up of the Archive Observatory. We are 
also starting bit-rot measurements. For our first 
set of experiments, we will be using sixty nodes 
with up to 500 decommissioned drives and 
simulate an archive (write once, read often) 
environment. We will then measure the amount 
of bit-rot observed for sixty days. Actually, this 
experiment will exclude certain causes of bit-rot 
such as wayward writes (that overwrite data on a 
neighboring track). To our knowledge, this is 
the first large-scale attempt to observe bit-rot in 
the laboratory. It expands on the experiment 
done by Gray and v. Ingen reported in [GI05], in 
which many fewer errors were observed than 
predicted by the drive manufacturer 
specifications. Currently, at the beginning of 
April 2006, almost a PB of data has been stored 
and checked; no genuine incident of bit-rot has 
been discovered, but a number of false 
initializations have been seen [IA061. 
Even if we completely succeed in analyzing disk 
failure behavior at the Internet Archive, we only 
obtain a single data point. We would like to 
invite other institutions with large disk 
populations to join the effort to measure actual 
disk failure behavior. Ultimately, we hope to 
increase the openness about failure rates and 
modes actually seen in practice. 
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